Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reader: implement parallel CSV reading #2070

Merged
merged 1 commit into from
Sep 29, 2023
Merged

reader: implement parallel CSV reading #2070

merged 1 commit into from
Sep 29, 2023

Conversation

Riolku
Copy link
Contributor

@Riolku Riolku commented Sep 22, 2023

This also refactors the CSVReader class to enable this change.

@Riolku
Copy link
Contributor Author

Riolku commented Sep 22, 2023

In testing, LDBC 100 loads in 3 seconds on ac4 (with 128 cores). However, this does not account for the overhead of counting the number of rows. If we are not going to make hash indexes resizable soon, we should add a dedicated CSV row counting function.

Furthermore, the 3 seconds does not account for the overhead of writing to disk, since I ran:

LOAD FROM <csv_path> RETURN COUNT(*)

On LDBC-10, which is much nicer to benchmark because the serial CSV reader loads it quickly enough, I got these numbers:

Serial: ~12s.
Parallel: ~350ms.

Again, in practice, we see only a 2x speedup since we pay the price of counting the rows.

@Riolku Riolku force-pushed the parallel-csv branch 3 times, most recently from 00500dc to 1a48115 Compare September 23, 2023 02:46
@@ -19,18 +19,21 @@ struct CSVReaderConfig {
char listBeginChar;
char listEndChar;
bool hasHeader;
bool parallel;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really related to this PR. I wonder if we should merge CSVReaderConfig with ReaderConfig at some point. Each reader can access a subset of fields from ReaderConfig class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some attributes can be shared. Parallel definitely should be. Others though... don't make sense, right?

@Riolku Riolku force-pushed the parallel-csv branch 3 times, most recently from 0846d9a to 18c7e9f Compare September 25, 2023 15:43
@codecov
Copy link

codecov bot commented Sep 25, 2023

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (f05d348) 89.55% compared to head (495e283) 89.60%.
Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2070      +/-   ##
==========================================
+ Coverage   89.55%   89.60%   +0.04%     
==========================================
  Files         981      985       +4     
  Lines       35901    35745     -156     
==========================================
- Hits        32151    32028     -123     
+ Misses       3750     3717      -33     
Files Coverage Δ
src/binder/bind/bind_file_scan.cpp 88.05% <100.00%> (+6.80%) ⬆️
src/binder/bind/bind_reading_clause.cpp 97.00% <100.00%> (ø)
src/common/copier_config/copier_config.cpp 54.54% <100.00%> (-3.79%) ⬇️
src/include/common/copier_config/copier_config.h 100.00% <100.00%> (ø)
...r/operator/persistent/reader/csv/base_csv_reader.h 100.00% <100.00%> (ø)
...erator/persistent/reader/csv/parallel_csv_reader.h 100.00% <100.00%> (ø)
...operator/persistent/reader/csv/serial_csv_reader.h 100.00% <100.00%> (ø)
...e/processor/operator/persistent/reader_functions.h 100.00% <100.00%> (ø)
...clude/processor/operator/persistent/reader_state.h 100.00% <ø> (ø)
src/processor/operator/persistent/copy_node.cpp 95.91% <100.00%> (+0.71%) ⬆️
... and 6 more

... and 12 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Riolku Riolku marked this pull request as ready for review September 25, 2023 20:38
@Riolku Riolku marked this pull request as draft September 25, 2023 20:39
@Riolku Riolku force-pushed the parallel-csv branch 8 times, most recently from 695fa48 to df729ba Compare September 28, 2023 20:16
@Riolku Riolku marked this pull request as ready for review September 28, 2023 20:42
@Riolku Riolku force-pushed the parallel-csv branch 2 times, most recently from 1cdb9b5 to 33b8ff4 Compare September 28, 2023 21:35
@Riolku Riolku requested a review from ray6080 September 28, 2023 21:53
This also refactors the CSVReader class to enable this change.
@ray6080 ray6080 merged commit 53d91db into master Sep 29, 2023
10 of 11 checks passed
@ray6080 ray6080 deleted the parallel-csv branch September 29, 2023 05:55
Riolku added a commit that referenced this pull request Sep 29, 2023
Riolku added a commit that referenced this pull request Sep 29, 2023
@@ -179,9 +179,10 @@ void CopyNode::checkNonNullConstraint(NullColumnChunk* nullChunk, offset_t numNo
}

void CopyNode::finalize(ExecutionContext* context) {
auto numNodes = StorageUtils::getStartOffsetOfNodeGroup(sharedState->getCurNodeGroupIdx()) +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we make this change? The change looks like a bug to me. This could just lead to 0 numNodes in statistics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants